A Canonical Measure of Allelic Association 



Markus Scholz Dirk Hasenclever 

Institute for Medical Informatics, Statistics and Epidemiofogy 
University of Leipzig 
Haertelstrasse 16-18 
04107 Leipzig 
Germany 

Corresponding author: 

Markus Scholz 
University of Leipzig 

Institute for Medical Informatics, Statistics and Epidemiology 
Haertelstrasse 16-18 
04107 Leipzig 
Germany 

Telephone: -f49 341 97 16190 
Fax: +49 341 97 16109 

Running head: Canonical measure of association 



Abstract 

The measurement of biallelic pair-wise association called linkage disequilibrium (LD) is an important 
issue in order to understand the genomic architecture. A large variety of such measures of association 
in two by two tables have been proposed in the literature. 

We propose and justify six biomctrical postulates which should be fulfilled by a canonical measure of 
LD. In short, LD measures are defined as a mapping of two by two probability tables to the set of real 
numbers. They should be zero in case of independence and extremal if one of the entries approaches 
zero while the marginals are positively bounded. They should reflect the symmetry group of two by 
two tables and be invariant under certain transformations of the marginals (selection invariant). There 
scale should be maximally discriminative for arbitrary tables, i.e. have maximum entropy relative to 
a calibrating symmetric distribution on the manifold of two by two probability tables. 
None of the established measures fulfil all of these properties in general. We prove that there is a unique 
canonical measure of LD for each choice of a calibrating symmetric distribution on the set of probability 
tables. An explicit formula of the canonical measure is derived for Jeffreys' non-informative Dirichlet 
prior distribution and the uniform Dirichlet distribution. We compare the canonical LD measures with 
other candidates from the literature. Based on the empirical distribution of association encountered in 
typical SNP data we recommend the canonical measure derived from Jeffreys' non-informative prior 
distribution when assessing linkage disequilibrium. Respective R-procedures are available on request. 
In a second part, we consider various estimators for the theoretical LD measures discussed and compare 
them in an extensive simulation study. The usual plug-in estimators based on frequencies can lead to 
unreliable estimates. Estimation functions based on the computationally expensive volume measures 
were proposed recently as a remedy to this well-known problem. We confirm that volume estimators 
have better expected mean square error than the naive plug-in estimators. But they are outperformed 
by estimators plugging-in easy to calculate non-informative Bayesian probability estimates into the 
theoretical formulae for the measures. 



Keywords: allelic association, Dirichlct distribution, linkage disequilibrium, maximum entropy, two 
by two contingency tables 



1. Introduction 



1.1. Background 



Modern genetic high-through-put methods increasingly provide medium to large size data sets that 
consist of high dimensional vectors of binary markers. We have been particularly motivated by the 
example of SNP-chips that address up to one million of biallelic single nucleotide polymorphisms 
(SNPs). Another example of this data type is patterns of genomic aberration in tumours that can 
be measured based again on SNP-chip technology or by matrix competitive genome hybridisation 
(mCGH). 

We restrict ourselves to one sample problems as opposed to two or more sample problems encoun- 
tered in the context of disease association case-control studies. The focus is to detect highly linked 
pairs of markers. In the case of SNPs this kind of association is called linkage disequilibrium (LD). 
Highly linked SNPs are interpreted to be inherited together. LD indicates that a recombination event 
between the two sites was rare in the population under study. However, there may be other reasons 
for high LD such as admixture or selection. Linkage has been analysed to understand the genomic 
architecture especial l y with respect to re combination hot-spots and jointly inherited haplotype blocks 



( Schulze et al 



2004 



Service et al 



2006^ . In the following we always restrict ourselves to LD between 
two biallelic markers. 

A basic step in analysing such data is assessing associations between markers in a very large 
number of two by two tables and comparing as sociations between tables. A bewildering plethora o f 



measures of association are used in the literature (jDevlin fc Risch , 



1995 



Some suggestions on the preferred use of single measures were made 



Hedrick 



1987 



Devlin fc Risch 



Thomas 



1995 



2004) 



Mueller 



20041 ) . Most of thes e arguments ar e based on biological issues such as dependence on allelic frequencies 



and rate of decay 



Hedrick 



(Pritchard fc Przeworski 



19871 ) or on practical applications such a s correlation of test statistics 



20011 ) and determination of haplotype blocks ([Gabriel et al 



20021 ) 



After a short review of different LD measures, we propose and justify biometrical and statistical 



postulates to choose between measures of association in the one sample case. We conclude that 
none of the established LD measures fulfil all of the desirable properties in general. Wc construct a 
family of canonical linkage disequilibrium measures which fulfil all of our postulates. Family members 
differ in the choice of a symmetric Dirichlct distribution on the set of all two by two contingency 
tables. These Dirichlet distributions calibrate the scale of the measure which essentially measures 
the extremacy of LD relative to the given distribution. The new measures are compared with the 
established once. Finally, the problem of estimation of the new measure is addressed and different 
estimators are compared in a simulation study. 

1.2. Measures of Linkage Disequilibrium 

We consider to analyse contingency tables of two biallelic markers at one strand of the genome. Let T 
be the manifold of all tetranominal probability models written as a two by two table of probabilities: 
T consists of all two by two matrices t with entries pij £ K, £ {0,1}) fulfilling the properties 
Pij > 0, '^ijPij = 1. The Pij denote the probabilities of the corresponding combination of the two 
alleles of the markers i and j. In the following, we abbreviate X]i=o Sj=o ^ X^i jj Pi- = Pio Pn ^^d 
P.j = Paj + Pij for convenience. Here, the marginals pi, and p,j denote the frequencies of the alleles of 
the two markers. 

Statistically, a measure of LD is simply a measure of association in the contingency table t. The 
following measures were defined in literature: 

D: The measure D is the absolute deviation of the observation from the expectation that the alleles of 
marker i are randomly combined with alleles of marker j under the assumption of constant marginals. 
Hence: 

D = Poo - Po.P.o 

This measure is zero in case of independence of the markers but extremal values depend on the 
marginals. 



Lewontin's D (|Lewontiij . 119631 ): The widely used measure D is a standardisation of the original 
measure D: 



D' 



D 



where _D„ 



min{po.P.i,P.oPi.} if D>0 
min{po.P.o,Pi.P.i} if D<0 



Lewontin's D' ranges from —1 to 1 and tends to these values if one of the pij tends to zero while the 
marginals are bounded a way from zero. 



Correlation coefficient r ([Hill fc Robertson 



19681 ): The usual correlation coefficient applied to binary 
data has similar popularity as D' . It also ranges from —1 to 1 where an absolute value of 1 is obtained 
when a diagonal of t tends to zero: 
D 



V Po.P.oPi.pl 



Odds ratio A (Edwards 



1963): 



A 



PooPii 



POlPlQ 

The odds ratio is the first quantity which is not directly dependent on D and the marginals. It is well 
known that A is independent of selection of single rows or columns of the table t. It is thus often used 
analysing (two sample) case-control studies. The odds ratio is extremal if one of the pij tends to zero 
while the marg i nals a re bounded away from zero. 



Yule 's Q (jYule 



19001): 



Q = 



A - 1 
A + 1 



Since the common odds ratio A is not standardised, this quantity has been defined as a function of A 
which is bounded to [—1, 1]. It can also be wri tten as a difference of the two conditional probabilities 



POOPll 



POOPll+POlPlO 



and 



POlPlO 



poopii+a oipio 



Mutual information MI 



H 



Li et al 



art una 



200 



1991) 



Weaver fc Shannonl . 



1963 ) 



MI = ^ Pij l0g2 Py - ^ P,. l0g2 Pt. - ^ P.J l0g2 P.J 



MI has its minimal value zero when poo = Po.P.o a-nd its maximal value one only if t is diagonal with 
either poo = Pii = 5 or poi = pio = ^. Hence, MI is not normaliz ed. 



2000 1. Rather than a 



A further measure Dvol has been proposed by Chen et AL. (jChen et al. 
new measure of LD it is an alternative estimation function for D' . Dvol is defined as the fraction of 
contingency tables with less extreme D in the space of all contingency tables with same sign of D and 
fixed marginals. 

The use of these measures has been discussed extensively and it has been recommended to calculate 
r when one marker is used to predict another m arker and to use D' as a measure of recombination 



probability (jPevlin fc Risch 



1995 



Mueller 



2OOJ) . However, all these recommendations lack of a clear 
definition of desired statistical properties of a measure of LD. Hence, the use and interpretation of 
these measures remain vague. 



2. The canonical measure of Linkage Disequilibrium 

2.1. Postulates for a Canonical Measure of Linkage Disequilibrium 

In this section we list postulates for a canonical measure of LD giving both biological and mathematical 
justifications. 

PI (Domain of association measure): A measure of LD is a continuous function 77 : T ^ K. 

The LD measure ry is formally defined on the manifold T of tetranomial probability models (with two 
by two lay-out) . not on a set of concrete realisations (for example as two by two data tables of sample 
size N). Defining the LD measure and estimating it from concrete data are radically separate tasks. 

P2 (Lack of association and complete linkage disequilibrium): Tables in linkage equilibrium 
show no association that is they fulfil pij = Pi.p.j G {0,1}). Complete LD is present 
whenever at least one pij approaches zero while the marginals retain a positive lower limit. 



When a new SNP emerges in a population by a single mutation event, the new allele is exclusively found 
in conjunction with only one of the two alleles of an already existing SNP. As long as no recombination 
event occurs, the new SNP remains in complete LD with the other SNP. The corresponding two by two 
table features a single zero cell. The measure r discussed above becomes extremal only if the underlying 
table approaches diagonal form. A diagonal structure suggests population admixture, specific negative 
selection or a mutation event giving rise to both SNPs simultaneously. The postulate covers all these 
cases. 

P3 (Symmetry): The LD measure rj reflects the two by two symmetry structure of T: 

a) r] is invariant under permutation of the SNPs which means matrix transposition. 

b) ry changes sign when alleles of a SNP, which are rows or columns of the table, are transposed. 

Since we consider the one-sample case none of the binary markers is distinguished. SNP transposition 
and allele transposition generate a symmetry group (a dihedral group D^) that should leave a measure 
of association essentially invariant. 

Requiring antisymmetry and thus introducing a sign to the linkage diseq uilibrium measu re is convenient 



in applications where we have well defined marked states of the markers (jThomasl . 120041 ). Alternatively, 
one may just take the absolute value of 77 to obtain a measure invariant under the full symmetry group. 
Note that P3 implies that tables in linkage equilibrium arc mapped to zero, since transposing preserves 
the condition of no association. 

P4 (Selection invariance:) The LD measure 77 is not changed by selection of alleles of one SNP 
that does not affect the corresponding allele frequency ratios of the other SNP. In other words, 
multiplication of columns or rows by positive numbers and renormalisation of the table does not 
change the measure of association. 

Selection of alleles and genetic drift is common during the course of evolution in a population. Allele 
frequencies fluctuate and differ in distinct populations. The measure of LD should capture an intrinsic 
property of the genome architecture that reflects the structural probability of a recombination event 



between two markers. It should thus not depend on the marginal allelic frequencies that happen to be 
observable in a given population. 

In addition, there may occur sampling selection in obtaining the data that introduces bias. Using a 
selection invariant measure of association safeguards against this danger up to a certain degree. 
Selection invariance is particularly important if the measure of association is intended to be meaning- 
fully compared between tables with markedly different marginal distributions (allele frequencies). 

P5 (Standardization): The LD measure rj is standardized to values in (—1,1). The extremal values 
±1 stand for perfect LD. To achieve uniqueness we require that T]{t) — > 1 for t — > ("^^ p"^) 

P6 (MsLximum Entropy): The LD measure rj should classify arbitrary tables of T as discrimina- 
tively as possible. To formalise the notion of arbitrary tables we choose a symmetric "non- 
informative" distribution on T and require that the induced distribution of the LD measure rj 
has maximum entropy. That is it is uniform on (—1, 1). 

Postulates 1 to 5 do not determine yet the scale for association values between (independence) and 
the extremal values ±1. The sixth postulate requires using a most informative and discriminative 
scale. The LD measure 77 should be a good general classifier of arbitrary tables of T. No particular 
value of the LD measure rj within (—1, 1) should be privileged. Therefore, we require that 77 have a 
uniform (maximum entropy) distribution on (—1,1) when sampling arbitrary tables from T. 

Postulate 6 implies the choice of a symmetric calibrating distribution on T to specify sampling 
"arbitrary" tables. In theorem 2 we will see that rj has an intuitive interpretation based on the 
proportion of tables having less extreme LD than the given one in the chosen distribution. 
Later we calculate the LD measure ry for Dirichlct distributions V (a), a ~ (aooj Q^oii QiiOi c^ii) («ij > 0). 
This family of distributions is often used in a Bayesian context as prior di s tribu t ion for contin gency 
tables since it is the conjugate prior to the multinomial distribution 



Geisser 



1984 



WallevI 



19961). The 



density of the Dirichlet distribution is given by: 



is the Beta-function and F denotes the Gamma function. 

The Dirichlet distributions are symmetric (invariant under transpositions of columns or rows) if and 
only if all cty are equal. For simplicity of notation, we identify the vector a with one of its components 
in the symmetric case. 

A principled choice for a symmetric non-informative distribution on T to define the canonical LP 



1984 : 



Jeffreys 



196lh . 



measure rj may be the well-known non-informative Jeffreys' prior a = ( Geissei 
In addition, with D (i) the distribution of the minor marginal frequencies is uniform on [0; 0.5] which 
in our experience is often encountered in SNP-array-data. We also discuss the choices a = I and a — 2. 



2.2. Construction of a Canonical Measure of Linkage Disequilibrium 

We will now explore the consequences of this set of postulates. In particular, we will show that for any 
continuous symmetric distribution on T the canonical LD measure rj exists and is unique. Later we will 
calculate 77 for symmetric Dirichlet distributions. We start with a closer look at selection invariance. 
Selection invariance may be formalised as the action of a suitable group G on T: Consider the group 
G = (M+ X IR+, •) with component- wise multiplication. 
For every {fj,, v) e M+ x M+ we define a map: ^(/i, v) : T — > T 



t 



Poo Poi 

PlO Pll 



1 



Aii^Poo + MPoi + vpw + Pll 



V 



(2.1) 



vpio Pll 

Since g{fi,i') o g{p,',h'') ~ g(fi ■ fi'^v ■ u') and g{\,l) = Idj this defines a G-group action on T. A 
function 77 : T ^ M is defined as selection invariant if rj{t) — rj{g{ii,v){t)) for all {^,v) E K+ x K+. 
Lying in the same group orbit defines an equivalence relation on T: We say two elements ti,t2 € T are 
equivalent ti ^ t2 if and only if there are (^, v) E x M+ with g(p, i^)(ti) — t2- Thus every selection 
invariant function ?; : T ^ M induces a well-defined map 77 : T ^ K. on the quotient space T = T/ ~. 



Theorem 1 (odds ratio): 

a) The odds ratio A : T M; t = ( ^ \(t) 

' ' \P10 Pll j ^ ' 

b) The odds ratio induces a homeomorphism A : T 



is selection invariant. 

POlPlO 



VI 



2-(l + Vl) 2-(l+VI) 



c) The inverse mapping A^^ : M+ T can be described by I 

1 Vi 

\ 2-{l + Vl) 2-{l+-/l) ] 

d) Every selection-invariant function 77 : T ^ R can be written as a function of A, namely 77 
{j]o A^^) o A. 



Proof: a) is easily verified. Every equivalence class [t] in T has a representant with marginals ^, 



namely [(7(y^^ii£ia^ y^ ^^^^"^ )(^)] which has the form given in c). d) is trivial. 



For every distribution T) on T the odds ratio A induces the distribution A.^ (I?) of the correspond- 
ing odds ratios on M+ . 

Theorem 2 (Existence of a canonical LD measure): Let I> be a symmetric (non-informative) 

distribution on T. Let L denote the cumulative distribution function of A, (2?) on R+. 

Define 



T]{t) = 2L(A(0)-1 



Then 77 is the unique canonical LD measure that fulfils postulates P1-P6 chosen T). 



(2.2) 



Proof: Because of theorem 1 the measure rj depends only on the odds ratio A of the table. By 
construction rj is uniform on (—1, 1). The remaining postulates and uniqueness are easily verified. □ 



Remark: Note that this construction provides 77 with an intuitive interpretation. r]{t) is a signed 
measure of extremality of the odds ratio A(^) relative to an underlying calibrating distribution V of 



arbitrary tables on T. It is based on the proportion of tables with less extreme odds ratio in this 
distribution. 



2.3. Calculation of the Canonical Measure of Linkage Disequilibrium for 
Symmetric Dirichlet Distributions 

We determine the distribution function of the odds ratio A under the Dirichlet distribution 'D(a). 
Define 

= {< e T : \{t) < A} VA e M+ 
We calculate 

/ fv{a) = -wr~{ I Poo°^^Poi'^^Pio°^^ C^-Poo-Poi-Pw)""''^ dpoodpoidpio (2.3) 
Using the formula 

Poo 



Pw = (1-poo-Poi)' 



Apoi + Poo 

we transform the coordinates (poOiPoijPio) to (poOjPoi: A) with the corresponding functional determi- 
nant 



9(poo,Poi,Pio) 



5(poo,Poi, A) 



/-, N PooPoi 
(1 - Poo - Poi) [ "2 



(Apoi +Poo) 

Furthermore, fl^ can be parameterized by poo € (0, 1), poi € (0, 1 — poo) and A G (0, A). Hence, (2.3) 
can be written as 

fvia) - riiX)dX (2.4) 

Qa Jo 



where 



A°"-i /-i-poo (1 - Poo -poi)°"^+""~' 

B{a) Jo Jo (Apoi +Poo) 



UA) = W / / '"\. . ^ dpo^dpoo (2.5) 



is the probability density of A under the Dirichlet distribution V (a). 

A plot of this density can be found in figure 1 for some special Dirichlet distributions. 



In the following, we consider symmetric Diriclilet distributions and denote the corresponding 
canonical LD measure with rja- 

In case of a G { 5 1 1 } there are analytic formulae for 7]a which will be derived now. 



Theorem 3 (Analytic formula for 771): 

^^(A) = 2^— VA^l (2.6) 
(A-1)' 

The gap of definition at A = 1 can be removed by taking the limit liniA^i 771 (A) = 0. 



Proof: At first wc solve the double integral for I in (2.5) 

ifw f^-Poo p„„p„^ (1 - Poo - Pox) ^ , ,„„^ 

l(X) ^ 6 / / "2 dpaidpoo (2.7) 

Jo Jo (Apoi +P00) 

Using partial fraction decomposition wc obtain after some calculation 

1{X) - -6 / ^(2A-2Apoo + (2poo + A-Apoo)hi -^^^ 1 rfpoo (2.8) 

Jo I Poo + A - Apoo J 

The only summand in (2.8) causing difficulties is /poo (2poo + A — Apoo) hi (poo + A — Apoo) dpoo- It 

can be solved by substitution of poo (1 — A), where it is necessary to distinguish A > 1 and A < 1. A 

singularity occurs for A = 1 which can be removed separately. After a longer calculation it follows that 

nX) ^ + + VA^l (2.9) 

(A-1)^ 

and /(I) = limA^i/(A) = i. 

With equation (2.9) the third integral can be calculated 

f^l-X + 2\nX In A 
l(X)dX = / 2 5 — H ^dA 

Jo [X-lf [X-lf 

Both summands can be dealt with using partial integration 
A2-A-AlnA 



l(X)dX - 

(A-1)' 



□ 



Theorem 4 (Analytic formula for 



VI (A) 



Iny 



Jo Vy{y-^. 



■dy-1 



(2.10) 



This integ ral can also be expressed in terms of the dilogarithm function dilog x = ~ y\ ^gg g 



Maximoml (j2003l ) for properties of dilog) for which good numerical procedures are available (jKoelbigl ) 



'71 (A) 



In ( VX] In 



\/A- 1 



dilog (Vx) - dilog (-Vx) \ - 1 



VA + 1 

Again, there is a gap of definition at A = 1 which can be resolved by limA-»i rji (A) = 0. 



(2.11) 



Proof: Equation (2.10) follows directly from equation (2.5) after elementary integration. Equation 
(2.11) follows from (2.10) after substitution of y/y and partial integration. □ 

2.4. Comparison of the Canonical LD Measures with Commonly Used LD 
Measures 

In this section we compare the canonical LD measures, especially 771 and 77 1 , with established measures 
of LD represented by its most common represantatives D' , r and Q. We analyse their commonalities, 
differences and behaviour in special situations with the help of seven remarks. 



Remark 1: All LD measures based on the odds ratio A, such as rja (for a > 0) and Q, are strictly 
monotone functions of each other. Figures 2 illustrates these functional relations. 



Remark 2: The measure Q is a good approximation for 772 with maxt^j \Q (t) — 7]2 {t)\ w 0.035. 
Hence, Q is approximately uniformly distributed under 2? (2). Optimal agreement of Q and j]a is 
obtained for a « 1.77 with maxjgT \Q {t) — rja {t)\ sa 0.013. 



Remark 3: The LD measure r needs a diagonal structure to approach the extremal values of ±1 



while D' , Q and j^q, (for a > 0) could become extremal whenever one table entry tends to zero. Thus 
for tables with _D', Q or r/ct near 1, r may assume values in (0, 1). Compare Figure 3. 



Remark 4: The LD measures D' and r are not selection invariant. 



Proof: We prove this by an example. Let ti = 5(13) ^^id t'l = j^{W)- It holds that ti ^ t2 
with ji = V = i. One calculates that D' (ti) — r (ti) ~ i while D' {t2) = r {12) = |. On the other 
hand A(ti) = A(i2) = 9. □ 
It is not surprising that measures depending on D arc not selection-invariant since the concept of D 
is based on constant marginals while selection may change these. 



Remark 5: For every e > there is a table G T such that \D' {t^)\ < e and \{tg) > i. Hence, 
there are tables for which D' measures almost no LD and A, Q and measure almost perfect LD. 



Proof: Consider for example 



t6 



V 



for (5 > sufficiently small. Then it follows that lim^^o D' {ts) ~ and lim^^o X {ts) — 00. □ 
Note however that this abnormality only occurs when three table entries tend to zero. Compare Figure 
4. 



Remark 6: For tables with not too imbalanced marginals, monotonicity between D' and odds ratio 
based measures is essentially preserved. Tabic 1 shows the Kendall correlation coefficient determined 
for tables with specified marginals. 



Remark 7: On a set of tables with constant marginals, poo is uniformly distributed under 



Thus after standardisation with Dmax, D' is uniformly distributed on (—1,1) and has maximum en- 
tropy in this case. By construction the canonical LD measures rji and 771 are uniformly distributed 
when paired with their respective Dirichlet distribution. Compare Figure 5. 



3. The Estimation Problem for LD Measures 

We now investigate various estimators for the theoretical linkage disequilibrium measures discussed 
above. Here, we do not address the problem that the entries sometimes must also be estimated 
from real data by a phasing algorithm in case of double heterozygote marker s . Thi s can be done for 
example with the help of the exact solution of an EM-algorithm (see IWeiii (jl996[ ) for details) . At 



this step it is necessary to assume haploid populations or polyploid populations in Hardy- Weinberg 
equilibrium. 

Estimation is based on observed contingency tables <Ar = ^ with G N and ■ riij = A^. 

The table ^at is regarded as a random realization of the true table t d T under the corresponding 
tetranomial distribution with sample size N. The tables form a sample space Tjv of all possible 
realizations of t after A^-fold sampling. In the following, we will define and compare different families 
of consistent estimators for linkage disequilibrium measures given an observation In G Tn- 



3.1. Estimators for LD Measures 

The common approach is the use of "plug- in" estimates, where the probability estimates pij of pij 
are inserted into the theoretical formula of a measure. Often, the frequency maximum-likelihood 
estimates ^ of pij are used. However this may le ad to inflated or u ndefined estimates of the desired 



quantity especially in case of small sample sizes (jTear et al 



2OO4). This approach has been used 



both extensively and carelessly in the literature to estimate for example D' and r. We will denote 
corresponding estimators as naive estimators NE. 



For any LD measure M it reads 



M^^(<) = M{t) with i 



"00 "01 

N N 



"10 "11 

N N 



(3.1) 



An a lternative approach is using " non- informative" Bayesian probabihty estimates for pij (jWallev 



mm 



P^j 



a + Tin , ( 1 

— ■ — ^ with ae <-,! 
Aa + N 12' 



(3.2) 



which is the expectation of the posteriori distribution V (a + n) were n = (noo, ".oii '^■107 nn). Calcu- 
lating M with the help of pij instead of pij yields a consistent semi-naive estimator SNE. It has the 
form 



M^^^(t) = M{t} with i 



^ Poo Poi ^ 



V 



(3.3) 



PlO Pll 

A further approach is to calculate the expectation of M under a posteriori distribution which is the 
ordinary Bayes estimator BE 



Rr^ (t) 



fv{a+n)M{t) dt 



1 



1 /•1-poo /-l-poo-poi 



aoo+«oo — l„Qoi+"oi — l„aio+"io — 1 

Poo Poi PlO 



B{a + n) Jo JO JO 
(1 - Poo - Poi - Pio)""^""~^ M (t) dpao dpoi dpio 



(3.4) 



Finally, some LD measures can be e stimated with a so-called volume formula. The concept of vol- 
ume measures can be traced back to 



HotellingI (|l939l ) and has bee n applied to conti ngency tables by 



Chen et al 



(|2006l ). In the simplest 



Diaconis fc EfronI (|l985f ) and to linkage disequilibrium measures byL 
case, the idea is to count the number of tables which are less "extreme" than an observed table and 
compare th i s "vol ume" with the total volume of all possible tables. 



Chen et al. 



( 20061 ) defined Dvol as the number of tables with fixed marginals, fixed sign of D and 
less extreme values of D divided by the number of tables with fixed marginals and fixed sign of D. 
In the original definition, this measure is always greater or equal to and less than 1. For better 



comparability with the other estimators, we consider an obvious signed version D'y^ with values in the 
interval (—1, 1) by assigning the sign of D o f the observed ta ble t^- The claimed advantage of this 



estimating procedure is that D'j^^ is inflated (jTear et al 



20021 ) while Dyj?, is not in case of tables with 



Chen et al 



20061 ) 



small numbers where the occurrence of zeros is likely 
Generalising this approach, we define an estimation function t)^^ for yy^. Any Dirichlet distribution 
I? (a) induces a probability distribution Wa on the sample space T^r and thus a discrete probability 
distribution of the corresponding odds ratios A (f). We define fj^^ {In) in analogy to equation (2.2) as 
the probability under T> (a) to obtain an odds ratio less extreme than A (tw) standardised to (—1, 1). 
By construction, the function L in (2.2) can be interpreted as the "volume" of tables with smaller odds 
ratio than A divided by the total volume of all tables in T^r. 

Note that Tat contains all tables with fixed sum of entries but not fixed marginals as in the defi- 
nition of D'y^. 

This definition can be used to construct another estimation function for rj directly from the observed 
contingency table t^r. In contrast to the construction of Dy^ we will not assume that all alternative 
tables ijv G Tjv are equally likely. Therefore, we calculate the probability of a single table In given a 
Dirichlet distribution. 



Theorem 5: Let Wa {tN) be the probability of the table In G Tjv under the distribution T) {a) 
on Tat, then we have 



where we define nB (n, x) = 1 for n = 0, x > and B is the Beta-function 



Wo.{tN) - N ^ (3.5) 



Proof: Let M {n,p) be the multinomial distribution of riij under the probabilities pij, then 

Wa{tN) = prob(tAr | I? (a)) 

= M in,p) fT,(a)dp 
Jt 



It 

/■I /■!— Poo (-l— Poo— Poi 

C/ / / Poq'Pqi'Pw" i'^-Pm-Poi-Pw)"'' dpiodpoidp 



Jo Jo 



where C = "STT^T (rf' ) ^'^'^ ~ ""y + ~ ^- Using the identity 



V 







it follows that 

Wa (iw) = CB {nio + 1, 7111 + 1)B {fioi + 1, nio + "ii + 2) B (rioo + 1, nio + "oi + "ii 
= C- 



Rewriting the last equation yields equation (3.5). 



Remark: Obviously, for the Dirichlet distribution V {1} we have w = constant 



With the last theorem, the volume estimator for 77 can be defined. Let 
(noo + aoo) («ii + an) 



A(ijv) 



(rioi + aoi) (niQ + aio) 



be the semi-naive estimator for the odds ratio, then, we define 

TV N-iN-i-j 

UA(0) = X! ^(«'J'^'^^«^J^^)x(A(t»j,fc,Af-i-j-fc),A(t7v) 



1=0 j=0 k=0 

with <ij,fc,; 



where the indicator function x has the form 

1 : Ai < A2 
I : Ai = A2 
: else 



x(Ai, A2) 



And finally 



(3.7) 



3.2. Comparison of Estimators 

We compared these estimation functions in a simulation study. First, we simulate true tables by 
random drawing from specified Dirichlet distributions. From it, true values of the linkage disequilibrium 
measures can be calculated. In the next step, we construct a concrete realization of the true tables by 
random drawing from the corresponding multinomial distribution with different sample sizes N . The 
estimation functions are compared with respect to their expected mean square error. 

The analysis is performed for yyi, r/i , I?', r, Q and corresponding estimation functions. Results 
can be found in table 2. 

Because of different variances of the true measures, we can only compare different estimators for 
one and the same LD measure. However, the results for 77 are comparable to those for D' and Q. 
Looking at the results presented in table 2 we can summarize the following observations. 

Observation 1: For all scenarios and measures, the naive estimator has the highest mean square 
error. 

Observation 2: Semi-naive and Bayes estimators perform almost equally well for all measures. How- 
ever, as expected the Bayes estimator performs best if the defining distribution equals the sampling 
distribution of the tables. The semi-naive estimator is robust against variation of the sampling distri- 
bution. 

Observation 3: The volume estimator for D' is better than the naive estimator but worse than 
the semi-naive estimators. It is especially worse in case of sampling distribution T) (i) were small 



entries of the tables are likely. 



Observation 4: The volume estimator performs comparable to the semi-naive and Bayes estima- 
tor for both rji and r]i . 

Summarising these results, we suggest to use one of the semi- naive estimators to estimate all link- 
age disequilibrium measures considered. The Bayes estimators are not better but arc computationally 
more expensive. The same holds true for the volume estimators. Moreover, for D' the volume esti- 
mator is clearly outperformed by the semi-naive estimators if the occurrence of small table entries is 
likely. 

3.3. Numerical Issues 

The analytic solutions (2.6) and (2.11) cause numerical problems, because of the singularity for A = 1 
in combination with the differences in the numerator. Hence, in the neighbourhood of A = 1 it is useful 
to replace the analytic formula by the corresponding Taylor series. After some calculation one finds 
that 

Vi{l + e) = ^{2e-e^)+0{e^) (3.8) 
mjl + e) = ^{2s~e')+0{e') (3.9) 

where O is the first Landau order symbol. 

The calculation of the Bayes estimator of rj is also computationally complicate. We suggest to use 
Monte-Carlo integration in combination with a quick sampling tool for Dirichlet distributions. The 
calculation of the volume estimator for rj is computationally expensive as well if the number of haplo- 
types is high, since computational effort rises with O (N^)- 



Algorithms of all methods have been implemented in the statistical software R (jlhaka fc Gentleman 



19961 ). We will provide the scripts upon request. 



4. Discussion 



In this paper we proposed and justified six postulates for a canonical measure of (allelic) association 
(linkage disequilibrium) intended for application to one-sample two by two contingency tables T: The 
measure is a mapping of T to the set of real numbers. It should be zero in case of independence and 
extremal if one of the entries approaches zero while the marginals are positively bounded. It should 
reflect the symmetry group of two by two tables and be invariant under certain transformations of the 
marginals (selection invariant). Their scale should be maximally discriminative for arbitrary tables 
relative to a calibrating (symmetric) distribution on the manifold of two by two probability tables. 
We proved that there is a unique canonical LD measure for each choice of a calibrating symmetric 
distribution on T. This calibrating distribution specifies an easy-to-intcrpret scale essentially based on 
the fraction of tables exhibiting a less extreme odds ratio than the given one. Although we will use 
Bayesian and empirical Baycsian considerations in the following in order to motivate the choice of the 
calibrating distribution, it is only nice but not necessary that the calibrating distribution is a proper 
Bayesian prior for data at hand. 

The canonical LD measures have maximum entropy relative to their defining calibrating distribution. 
The principle of ma ximum entropy clas si fiers is not new and has been applied to several areas of 



interest (for example 



Nigam et al 



(19991); 



Zhuetal 



( 20051 1). However, to our knowledge there is no 



application of maximum entropy classifiers to the problem of association measures of two by two con- 
tingency tables. 

Theoretical and empirical arguments support the choice of I? (^) as calibrating distri bution. Tp lh) i s 



19611) 



Jeffreys' non-informative prior on T derived from an information invariance principle (jJeffrevs 
2? (^) induces the uniform distribution on the marginal frequencies and is weakly informative concern- 
ing the odds ratio (confer Figure 1). Empirically in our experience, SNP-array data often exhibit a 
rather uniform distribution of minor allele frequencies when disregarding extremely rare SNPs (confer 
Figure 6). Consequently, 771 tends to have a roughly uniform distribution when calculating pair- wise 



LD in a small region of the genome (confer Figure 7) . Thus rji can also be interpreted in an empirical 
Bayesian way as the fraction of tables in the analysed data exhibiting a less extreme odds ratio than 
the given one. 

Hence, for applications in SNP data we recommend the use of iji. In situations where most tables 
have less imbalanced marginals, Q (corresponding to 772) is a reasonable alternative. 
The popular measures D' and r are not selection invariant. D' is motivated by a biological model 
of human evolution and genomic structure which is not in the focus of our biomctrical point of view 



(Morton et al 



2001 



Shete 



20031 ). Selection invariancc is particularly important if one wants to com- 
pare LD between pairs of SNPs across the genome or across different populations. In this case one 
needs a measure of association that can be compared between tables with markedly different marginal 
distributions (allele frequencies). 

The measure r (and All) is extremal when a diagonal of the table tends to zero. The canonical measure 
is extremal when there is one zero in the table because an emerging single SNP gives rise to a table 
with one zero. On the other hand, measures which arc extremal only for tables with a diagonal zero 
are pertinent when measuring the degree of redundancy between single markers. 
We sharply distinguish between the definition of a LD me asure and its estimation. For D' t he estima - 
tion problem has been considered recently and partly by 
investigated jackknife and bootstrap estimators for D' . 



Sebastiani fc Abad-Graul j2007t ) 



The usual naiv e plug-in estimators based on fre quencies can lead to unreliable estimates (jChen et al 



2006 



Lo 



1991 



Sebastiani fc Abad-Grau 



20071) . Estimation functions based on the computationally 
expensive volume measures (|Chen et al.l . 12006 ) were proposed recently as a remedy to this well-known 
problem. 

Here, we investigated four different consistent estimation functions for the measures 77, D' , r and Q, 
the naive estimator, the semi-naive estimator, the Baycs estimator and the volume estimator (for 77, 
D' on ly) and c ompared them in an extensive simulation study based on the expected mean square 



error ( Ld . 



19911) . 



We confirmed that volume estimators have better expected mean square error than the naive plug-in 
estimators. In the case of D' , volume estimation perform worse than the semi-naive estimator partic- 
ularly for the sampling distribution f (5)- The reason is that the volume definition for D' is based 
on tables with fixed marginals. Implicitly the marginals are treated as certain but they are in fact 
random. In contrast, our volume estimator for 77 treats the marginals as random and its accuracy is 
reasonable. 

In our study the scmi-naivc estimator outperforms the naive estimator with respect to accuracy and 
the volume and Bayesian estimators with respect to computational cost. 

In summary we propose a canonical measure rji for analysing linkage disequilibrium in the one-sample 
case. The canonical measure is uniquely characterised by a set of six biometrical postulates. It is easy 
to interpret and can be economically calculated and estimated by the semi-naive estimator using R 
functions which we will be glad to provide on request. 
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Figure 1: Density of the log odds ratio under diflerent Dirichlet distributions 




Figure 2: Relation between odds ratio based measures of LD 




Figure 3: Correlation of r with other measures of LD (100,000 simulations from 




Figure 4: Correlation of D' with odds ratio based measures of LD (100,000 simulations from 
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Figure 5: Distribution of various LD measures assuming a Dirichlet distribution f (^) or P (1) on T. 
By construction, the canonical LD measures rji and rji are uniformly distributed for T) (i) and T) (1) 
respectively. D' is uniformly distributed for T> (1). 
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Figure 6: Distribution of minor alle lic frequency in the CEU HapMap sample genotyped with Mapping 



500k Affymctrix chipset (jHapmapl ). 
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Figure 7: Distribution of Q and 771 for all pairs of markers of chromosome 22 of the CEU HapMap 
sample. We selected pairs of SNPs with minor allelic frequency greater than 10% and distances less 
than 50kb. Semi-naive estimators using a — ^ were calculated. It reveals that rii differentiates 
considerably better between tables than Q. 



Tables 



Minor marginal frequencies 


Kendall's r 


0% to 10% 


0.873 


10% to 20% 


0.905 


20% to 30% 


0.916 


30% to 40% 


0.930 


40% to 50% 


0.957 



Table 1: Kendall's correlation coefficient between D' and A for tables with specified marginal fre- 
quencies (based on a sample of = 1, 000, 000 drawn from V (^)). 



Estimator 


^ \ 2 ' 2 ' 2 ' 2 / 


0(1,1,1,1) 


0(2,2,2,2) 


iV = 50 


iV = 100 


iV = 500 


iV = 50 


N = 100 


iV = 500 


N = 50 


N = 100 


iV = 500 


'11 


0.127 


0.084 


0.027 


0.083 


0.039 


0.0064 


0.050 


0.022 


0.0040 


f^SNE 


0.070 


0.047 


0.016 


0.040 


0.022 


0.0052 


0.033 


0.018 


0.0038 




0.063 


0.042 


0.014 


0.040 


0.022 


0.0051 


0.034 


0.018 


0.0038 




0.067 


0.046 


0.016 


0.041 


0.023 


0.0052 


0.037 


0.019 


0.0039 


^NE 


0.140 


0.093 


0.0304 


0.086 


0.039 


0.0056 


0.038 


0.0134 


0.0019 


~SNE 


0.040 


0.027 


0.0095 


0.022 


0.012 


0.0029 


0.017 


0.0088 


0.0018 


2 


0.036 


0.024 


0.0085 


0.025 


0.014 


0.0031 


0.020 


0.0097 


0.0018 


vF 

2 


0.040 


0.027 


0.0094 


0.029 


0.015 


0.0033 


0.025 


0.0116 


0.0020 




0.115 


0.074 


0.023 


0.081 


0.039 


0.0072 


0.053 


0.024 


0.0045 


D'sNE.l 


0.080 


0.053 


0.017 


0.048 


0.027 


0.0065 


0.037 


0.020 


0.0043 


SNE 1 


0.067 


0.045 


0.015 


0.050 


0.028 


0.0065 


0.041 


0.021 


0.0044 


D'be.i 


0.072 


0.047 


0.015 


0.047 


0.027 


0.0064 


0.037 


0.020 


0.0044 


D'beI 

2 


0.064 


0.042 


0.014 


0.051 


0.028 


0.0065 


0.043 


0.022 


0.0044 


D'VE 


0.114 


0.069 


0.021 


0.058 


0.031 


0.0067 


0.038 


0.020 


0.0044 


T 7\rir 
' 1\E 


0.017 


0.0087 


0.0019 


0.017 


0.0085 


0.0017 


0.018 


0.0090 


0.0018 




0.015 


0.0082 


0.0018 


0.015 


0.0078 


0.0017 


0.016 


0.0084 


0.0018 


^ SNE — 


0.014 


0.0078 


0.0018 


0.016 


0.0080 


0.0017 


0.017 


0.0086 


0.0018 


f-BE.l 


0.015 


0.0082 


0.0018 


0.015 


0.0078 


0.0017 


0.016 


0.0084 


0.0018 


^BE.^ 


0.014 


0.0078 


0.0018 


0.015 


0.0079 


0.0017 


0.017 


0.0086 


0.0018 


Q NE 


0.135 


0.090 


0.029 


0.095 


0.047 


0.0085 


0.071 


0.034 


0.0065 


QSNE.I 


0.087 


0.059 


0.020 


0.059 


0.033 


0.0076 


0.055 


0.030 


0.0064 


Q SNE.^ 


0.076 


0.051 


0.017 


0.061 


0.034 


0.0077 


0.060 


0.031 


0.0064 


Qbe.1 


0.080 


0.053 


0.018 


0.058 


0.033 


0.0076 


0.055 


0.030 


0.0064 


Qbe.^ 


0.072 


0.047 


0.016 


0.062 


0.034 


0.0077 


0.059 


0.031 


0.0064 



Table 2: The expected mean square error for different estimators of different LD measures based 
on f 00,000 simulations of true tables drawn from the Dirichlet distribution in the columns and their 
realizations with sample size N. The estimators are explained in the text. Except for the canonical 
measures, we calculate the semi-naive and Bayesian estimators for both a = ^ and a = 1 as well. 



